Gene-level differential expression analysis using DESeq2

Coral Host

DESeq2 package

The package DESeq2 provides methods to test for differential expression by use of negative binomial generalized linear models; the estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions.

Vignette: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Import libraries

library(RColorBrewer)
library(circlize)
library(pheatmap)
library(gplots) #heatmap.2()
library(ComplexHeatmap)

library(tidyverse)
library(vegan)
library(cowplot)
library(ggrepel)
library(devtools)
library(vsn)
library(dplyr)
library(plyr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(data.table)

# first time installation:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")
# BiocManager::install("DEGreport")
# BiocManager::install("arrayQualityMetrics")
library(BiocManager)
## Warning in file(con, "r"): URL 'https://bioconductor.org/config.yaml': status
## was 'Couldn't resolve host name'
## Warning in file(con, "r"): URL 'http://bioconductor.org/config.yaml': status was
## 'Couldn't resolve host name'
library(DESeq2)
library(DEGreport)
library(arrayQualityMetrics)
library(locfit)

Set working directory

setwd("~/Documents/Rprojects/postdoc\ Rprojects/ODU_postdoc_Rprojects/Paper_Ojo_gene-expression/Siderastrea/host/")

Counts data

As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j.

The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq).

The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

Import raw data

host <- read.csv("Sid_hybridref_final_counts_Host.csv", header=TRUE, row.names=1)
dim(host)  # 19,222 genes
## [1] 19222    56
## Some genes were not detected in any samples (genes with zero reads across all samples)
host <- host[rowSums(host) > 0, ]
dim(host)  # 18,798 genes
## [1] 18798    56

Metadata (conditions)

NB! It is absolutely critical that the columns of the count matrix and the rows of the column data (information about samples) are in the same order. There should be the same number of conditions described as there are samples in your data file, and in the same order.

rownames of meta have to match the column names of counts file

meta <- read.csv("~/Documents/Rprojects/postdoc\ Rprojects/ODU_postdoc_Rprojects/Paper_Ojo_gene-expression/Siderastrea/Siderastrea_Year-2_metadata.csv", header=TRUE, row.names=1)

# Remove all samples that were transplanted to the Destination Ojo.Norte
# There is no control site for Norte Ojo
# For now, we are focused on samples transplanted to Laja Control or Laja Ojo sites
# This is consistent with analyses for the other 2 species in the experiment
# Also, group sample size is too small for Ojo.Norte destination

meta <- subset(meta, Destination_name != "Ojo.Norte")
meta$Destination_type <- revalue(meta$Destination_type, c("control" = "Control", "ojo" = "Ojo"))

meta$Origin_type <- revalue(meta$Origin_type, c("Control" = "Lagoon"))

meta <- meta %>% unite(group, Origin_type, Destination_type, sep = ".", remove = FALSE)


meta[sapply(meta, is.character)] <- lapply(meta[sapply(meta, is.character)], 
                                       as.factor)
# Colony_ID is equivalent to genotype
meta$Colony_ID <- as.factor(meta$Colony_ID)

levels(meta$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"       
## [5] "Reef.Control"   "Reef.Ojo"

select relevant metadata

meta <- dplyr::select(meta, Colony_ID, group, Origin_name, Origin_type, Destination_name, Destination_type, pH_Destination)

Descriptive statistics

## The do.call() function produces a data frame with one col per sample, 
## transpose it to obtain one row per sample and one column per statistics.
stats.per.sample <- data.frame(t(do.call(cbind, lapply(host, summary))))

stats.per.sample$libsum <- apply(host, 2, sum) ## libsum
stats.per.sample$zeros <- apply(host==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(host)

head(stats.per.sample)
##                   Min. X1st.Qu. Median     Mean X3rd.Qu.   Max.  libsum zeros
## PC_349_C_Sid_yr2     0        0      1 37.93297        3 136872  713064  7870
## PC_350_La_Sid_yr2    0        0      0 20.49149        2 120730  385199  9494
## PC_361_N_Sid_yr2     0        0      1 44.24891        2 213556  831791  8773
## PC_362_C_Sid_yr2     0        0      1 52.56495        3 257992  988116  7901
## PC_363_La_Sid_yr2    0        0      0 56.87775        2 280455 1069188  9864
## PC_364_N_Sid_yr2     0        0      2 81.47521        6 350992 1531571  4812
##                   percent.zeros
## PC_349_C_Sid_yr2       41.86616
## PC_350_La_Sid_yr2      50.50537
## PC_361_N_Sid_yr2       46.66986
## PC_362_C_Sid_yr2       42.03107
## PC_363_La_Sid_yr2      52.47367
## PC_364_N_Sid_yr2       25.59847

Distributions

Histograms of counts per gene

  • Top: raw counts. the scale is determined by the gene with the highest count, which is apparently an outlier.
  • Middle: raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers.
  • Bottom: log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts.
par(mfrow=c(3,1))

hist(as.matrix(host), col="blue", border="white", breaks=50)

hist(as.matrix(host), col="blue", border="white",
     breaks=20000, xlim=c(0,500), main="Counts per gene",
     xlab="Counts (truncated axis)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(host + epsilon)), breaks=100, col="blue", border="white",
     main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes",
     las=1, cex.axis=0.7)

Boxplots of gene count distributions per sample (non-normalized log2(counts))

boxplot(log2(host + epsilon), col=meta$group, pch=".", 
        horizontal=TRUE, cex.axis=0.5, las=1, 
        xlab="log2(Counts +1)")

Density plots (log2(counts))

if(!require("affy")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("affy")  
}
library(affy)

## Each curve corresponds to one sample 
plotDensity(log2(host + epsilon), lty=1, col= meta$group, lwd=2)
grid()

NB! the R function plotDensity() does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.

Filtering

The filtering of low-expression genes is a common practice in the analysis of RNA-seq data. There are several reasons for this. For the detection of differentially expressed genes (DEGs) and from a biological point of view, genes that not expressed at a biologically meaningful level in any condition are not of interest and are therefore best ignored, but also because it can increase the number of total DEGs after correction of multiple testing, improving sensitivity and the precision of DEGs after filtering (Bourgon et al., 2010).

In addition, genes/transcripts having a low read count are generally considered as artifacts or ‘noise’. From a statistical point of view, removing low count genes allows the mean-variance relationship in the data to be estimated with greater reliability (Law et al., 2016).

https://seqqc.wordpress.com/2020/02/17/removing-low-count-genes-for-rna-seq-downstream-analysis/

# how many genes have mean count >=3 
host_means_filtered <- cbind(host, means = apply(host, 1, mean))
table(host_means_filtered$means>=3)
## 
## FALSE  TRUE 
## 10798  8000
host_means_filtered <- subset(host_means_filtered, host_means_filtered$means>=3)
host_means_filtered <- host_means_filtered[, 1:56]
dim(host_means_filtered)
## [1] 8000   56

Although RNA-seq technology has improved the dynamic range of gene expression quantification, low-expression genes may be indistinguishable from sampling noise. The presence of noisy, low-expression genes can decrease the sensitivity of detecting DEGs. Thus, identification and filtering of these low-expression genes may improve DEG detection sensitivity.

Genes with very low counts across all samples provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis.

For RNA-seq summarized to counts, suggestion to filter on absolute counts, since one could make the argument that genes with low observed counts are probably not really expressed, or that their expression cannot be reliably measured. Furthermore, low count data (which usually also contains many zeros) are not really suitable for a correlation analysis. This simple approach works fine for bulk (tissue) sequencing data but be very careful about applying it to single-cell sequencing where many interesting genes may have zero counts in most cells, and relatively low counts in the rest.

Boxplots of gene count distributions per sample

boxplot(log2(host_means_filtered + epsilon), col=meta$group, pch=".", 
        horizontal=TRUE, cex.axis=0.5, las=1, 
        xlab="log2(Counts +1)")

Percentage of null counts

prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))

barplot(prop.null, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=meta$group, ylab='Samples', xlab='% of null counts')

Note: 1 sample has >85% of null (zero) counts

prop.null[order(prop.null)]
##   R_012_C_Sid_yr2  R_013_La_Sid_yr2   R_005_N_Sid_yr2  R_025_La_Sid_yr2 
##            0.9000            1.2000            1.4500            1.5375 
##  R_004_La_Sid_yr2  R_031_La_Sid_yr2   R_020_N_Sid_yr2   R_017_N_Sid_yr2 
##            1.7500            1.9625            2.1750            2.5750 
## PC_372_La_Sid_yr2   R_009_C_Sid_yr2  R_010_La_Sid_yr2   R_015_C_Sid_yr2 
##            2.6500            2.9250            3.3375            3.4750 
##   R_011_N_Sid_yr2  R_019_La_Sid_yr2  PC_365_C_Sid_yr2  PO_334_C_Sid_yr2 
##            3.6125            3.8750            4.2750            4.8500 
##  PC_364_N_Sid_yr2  R_022_La_Sid_yr2  PC_368_C_Sid_yr2  PO_337_C_Sid_yr2 
##            5.2250            5.3125            5.7625            5.9125 
## PC_366_La_Sid_yr2   R_027_C_Sid_yr2 PO_341_La_Sid_yr2 PC_375_La_Sid_yr2 
##            5.9625            6.0750            6.6875            6.9000 
##   R_021_C_Sid_yr2  PC_371_C_Sid_yr2 PO_332_La_Sid_yr2   R_030_C_Sid_yr2 
##            7.3000            8.5125            8.8875            9.8375 
##  PO_345_N_Sid_yr2  PO_343_C_Sid_yr2   R_003_C_Sid_yr2  PO_340_C_Sid_yr2 
##            9.8500           10.4250           12.0000           12.2875 
##  PC_374_C_Sid_yr2 PO_344_La_Sid_yr2 PO_347_La_Sid_yr2  PC_362_C_Sid_yr2 
##           12.6625           12.9750           12.9875           14.3500 
##  PC_349_C_Sid_yr2  PO_348_N_Sid_yr2  PO_342_N_Sid_yr2  PO_339_N_Sid_yr2 
##           14.4750           15.1250           15.4500           15.6500 
##  PC_361_N_Sid_yr2   R_029_N_Sid_yr2 PC_350_La_Sid_yr2 PO_335_La_Sid_yr2 
##           19.9000           21.0750           21.6500           23.1750 
##  PO_331_C_Sid_yr2 PC_363_La_Sid_yr2 PO_338_La_Sid_yr2  PO_333_N_Sid_yr2 
##           23.5375           24.4000           32.5125           41.0625 
##   R_024_C_Sid_yr2  PC_370_N_Sid_yr2   R_014_N_Sid_yr2   R_018_C_Sid_yr2 
##           52.7625           52.8500           65.0000           73.4000 
##  PO_336_N_Sid_yr2  PC_373_N_Sid_yr2  PC_376_N_Sid_yr2   R_026_N_Sid_yr2 
##           78.4000           80.5750           82.0750           88.1500

Remove sample(s) with >85% of null/zero counts

drop <- c("R_026_N_Sid_yr2")

host_means_filtered <- host_means_filtered[,!(names(host_means_filtered) %in% drop)]
meta <- meta[!(row.names(meta) %in% drop),]

prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))
prop.null[order(prop.null)]
##   R_012_C_Sid_yr2  R_013_La_Sid_yr2   R_005_N_Sid_yr2  R_025_La_Sid_yr2 
##            0.9000            1.2000            1.4500            1.5375 
##  R_004_La_Sid_yr2  R_031_La_Sid_yr2   R_020_N_Sid_yr2   R_017_N_Sid_yr2 
##            1.7500            1.9625            2.1750            2.5750 
## PC_372_La_Sid_yr2   R_009_C_Sid_yr2  R_010_La_Sid_yr2   R_015_C_Sid_yr2 
##            2.6500            2.9250            3.3375            3.4750 
##   R_011_N_Sid_yr2  R_019_La_Sid_yr2  PC_365_C_Sid_yr2  PO_334_C_Sid_yr2 
##            3.6125            3.8750            4.2750            4.8500 
##  PC_364_N_Sid_yr2  R_022_La_Sid_yr2  PC_368_C_Sid_yr2  PO_337_C_Sid_yr2 
##            5.2250            5.3125            5.7625            5.9125 
## PC_366_La_Sid_yr2   R_027_C_Sid_yr2 PO_341_La_Sid_yr2 PC_375_La_Sid_yr2 
##            5.9625            6.0750            6.6875            6.9000 
##   R_021_C_Sid_yr2  PC_371_C_Sid_yr2 PO_332_La_Sid_yr2   R_030_C_Sid_yr2 
##            7.3000            8.5125            8.8875            9.8375 
##  PO_345_N_Sid_yr2  PO_343_C_Sid_yr2   R_003_C_Sid_yr2  PO_340_C_Sid_yr2 
##            9.8500           10.4250           12.0000           12.2875 
##  PC_374_C_Sid_yr2 PO_344_La_Sid_yr2 PO_347_La_Sid_yr2  PC_362_C_Sid_yr2 
##           12.6625           12.9750           12.9875           14.3500 
##  PC_349_C_Sid_yr2  PO_348_N_Sid_yr2  PO_342_N_Sid_yr2  PO_339_N_Sid_yr2 
##           14.4750           15.1250           15.4500           15.6500 
##  PC_361_N_Sid_yr2   R_029_N_Sid_yr2 PC_350_La_Sid_yr2 PO_335_La_Sid_yr2 
##           19.9000           21.0750           21.6500           23.1750 
##  PO_331_C_Sid_yr2 PC_363_La_Sid_yr2 PO_338_La_Sid_yr2  PO_333_N_Sid_yr2 
##           23.5375           24.4000           32.5125           41.0625 
##   R_024_C_Sid_yr2  PC_370_N_Sid_yr2   R_014_N_Sid_yr2   R_018_C_Sid_yr2 
##           52.7625           52.8500           65.0000           73.4000 
##  PO_336_N_Sid_yr2  PC_373_N_Sid_yr2  PC_376_N_Sid_yr2 
##           78.4000           80.5750           82.0750

Removed additional outlier that was identified through below data exploration and evaluation: - smallest number of raw counts - identified as outlier in cluster dendrogram - identified as outlier in PCA - very small sizefactor (0.06066512) - very high proportion of null counts (73%)

drop <- c("R_018_C_Sid_yr2")

host_means_filtered <- host_means_filtered[,!(names(host_means_filtered) %in% drop)]
meta <- meta[!(row.names(meta) %in% drop),]
prop.null <- apply(host_means_filtered, 2, function(x) 100*mean(x==0))

barplot(prop.null, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=meta$group, ylab='Samples', xlab='% of null counts')

Remove Destination Ojo.Norte samples

drop <- c("PC_361_N_Sid_yr2", "PC_364_N_Sid_yr2", "PC_370_N_Sid_yr2", "PC_376_N_Sid_yr2", "PC_373_N_Sid_yr2", "PO_333_N_Sid_yr2", "PO_336_N_Sid_yr2", "PO_339_N_Sid_yr2", "PO_342_N_Sid_yr2", "PO_345_N_Sid_yr2", "PO_348_N_Sid_yr2", "R_005_N_Sid_yr2", "R_011_N_Sid_yr2", "R_014_N_Sid_yr2", "R_017_N_Sid_yr2", "R_020_N_Sid_yr2", "R_029_N_Sid_yr2")

host_means_filtered <- host_means_filtered[,!(names(host_means_filtered) %in% drop)]
meta <- meta[!(row.names(meta) %in% drop),]

host_means_filtered <- droplevels(host_means_filtered)
meta <- droplevels(meta)

data summary

meta %>%
  group_by(group) %>%
  dplyr::summarise(count = n())
## # A tibble: 6 x 2
##   group          count
## * <fct>          <int>
## 1 Lagoon.Control     6
## 2 Lagoon.Ojo         5
## 3 Ojo.Control        5
## 4 Ojo.Ojo            6
## 5 Reef.Control       8
## 6 Reef.Ojo           7

RNA-seq count distribution

To determine the appropriate statistical model, we need information about the distribution of counts.

These images illustrate some common features of RNA-seq count data, including - a low number of counts associated with a large proportion of genes, and - a long right tail due to the lack of any upper limit for expression.

# To get an idea about how RNA-seq counts are distributed, plot counts for a couple samples
p1 <- ggplot(host_means_filtered) +
  geom_histogram(aes(x = PC_349_C_Sid_yr2), stat = "bin", bins = 200) +
  xlim(-5, 500)  +
  xlab("Raw expression counts") +
  ylab("Number of genes")

p2 <- ggplot(host_means_filtered) +
  geom_histogram(aes(x = R_025_La_Sid_yr2), stat = "bin", bins = 200) +
  xlim(-5, 500)  +
  xlab("Raw expression counts") +
  ylab("Number of genes")

grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 22 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 426 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

Modeling count data

Count data is often modeled using the binomial distribution, which can give you the probability of getting a number of heads upon tossing a coin a number of times. However, not all count data can be fit with the binomial distribution. The binomial is based on discrete events and used in situations when you have a certain number of cases.

When the number of cases is very large (i.e. people who buy lottery tickets), but the probability of an event is very small (probability of winning), the Poisson distribution is used to model these types of count data. The Poisson is similar to the binomial, but is based on continuous events. Details provided by Rafael Irizarry in the EdX class.

With RNA-Seq data, a very large number of RNAs are represented and the probability of pulling out a particular transcript is very small. Thus, it would be an appropriate situation to use the Poisson distribution. However, a unique property of this distribution is that the mean == variance. Realistically, with RNA-Seq data there is always some biological variation present across the replicates (within a sample class). Genes with larger average expression levels will tend to have larger observed variances across replicates.

If the proportions of mRNA stayed exactly constant between the biological replicates for each sample class, we could expect Poisson distribution (where mean == variance). A nice description of this concept is presented by Rafael Irizarry in the EdX class. But this doesn’t happen in practice, and so the Poisson distribution is only considered appropriate for a single biological sample.

The model that fits best, given this type of variability between replicates, is the Negative Binomial (NB) model. Essentially, the NB model is a good approximation for data where the mean < variance, as is the case with RNA-Seq count data.

plot the mean versus the variance of your data

Remember for the Poisson model, mean = variance, but for Negative Binomial, mean < variance

mean_counts <- apply(host_means_filtered, 1, mean) 
variance_counts <- apply(host_means_filtered, 1, var) 
df <- data.frame(mean_counts, variance_counts)

ggplot(df) +
        geom_point(aes(x=mean_counts, y=variance_counts)) + 
        geom_line(aes(x=mean_counts, y=mean_counts, color="red")) +
        scale_y_log10() +
        scale_x_log10() +
        theme(legend.position = "none")

Note that in the above figure, the variance across replicates tends to be greater than the mean (red line), especially for genes with large mean expression levels. This is a good indication that our data do not fit the Poisson distribution and we need to account for this increase in variance using the Negative Binomial model (i.e. Poisson will underestimate variability leading to an increase in false positive DE genes).

How does the dispersion relate to our model?

To accurately model sequencing counts, we need to generate accurate estimates of within-group variation (variation between replicates of the same sample group) for each gene. With only a few (3-6) replicates per group, the estimates of variation for each gene are often unreliable (due to the large differences in dispersion for genes with similar means).

To address this problem, DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.

Estimating the dispersion for each gene separately: To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.

Count normalization using DESeq2

Ensure the row names of the metadata dataframe are present and in the same order as the column names of the counts dataframe.

To create the object we will need the count matrix and the metadata table as input. We will also need to specify a design formula. The design formula specifies the column(s) in the metadata table and how they should be used in the analysis.

# Check that sample names match in both files
all(colnames(host_means_filtered) %in% rownames(meta))
## [1] TRUE
all(colnames(host_means_filtered) == rownames(meta))
## [1] TRUE
# If your data did not match, you could use the match() function to rearrange them to be matching.
levels(meta$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"       
## [5] "Reef.Control"   "Reef.Ojo"

Create DESeq2 Dataset object

# can only do 2 group comparisons (limitation of DESeq2)

dds <- DESeqDataSetFromMatrix(countData = host_means_filtered,
                              colData = meta,
                              design= ~group) 

Note that neither rlog transformation nor the VST are used by the differential expression estimation in DESeq, which always occurs on the raw count data, through generalized linear modeling which incorporates knowledge of the variance-mean dependence. The rlog transformation and VST are offered as separate functionality which can be used for visualization, clustering or other machine learning tasks.

Count data transformation

In order to test for differential expression, we operate on raw counts and use discrete distributions as described in the previous section on differential expression. However for other downstream analyses – e.g. for visualization or clustering – it is useful to work with transformed versions of the count data.

Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors.

The point of these two transformations, the VST and the rlog, is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. In particular, genes with low expression level and therefore low read counts tend to have high variance, which is not removed efficiently by the ordinary logarithmic transformation.

Both VST and rlog use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, in a figure below, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with row variance above the trend which will allow us to cluster samples into interesting groups.

blind transformation to experimental design (TRUE / FALSE)

If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis.

Blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula. Note that only the fitted dispersion estimates from mean-dispersion trend line are used in the transformation (the global dependence of dispersion on mean for the entire experiment). So setting blind to FALSE is still for the most part not using the information about which samples were in which experimental group in applying the transformation.

varianceStabilizingTransformation

This function calculates a variance stabilizing transformation (VST) from the fitted dispersion mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size.

The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.

The transformed data should be approximated variance stabilized and also includes correction for size factors or normalization factors. The transformed data is on the log2 scale for large counts.

Limitations: In order to preserve normalization, the same transformation has to be used for all samples. This results in the variance stabilizition to be only approximate. The more the size factors differ, the more residual dependence of the variance on the mean will be found in the transformed data. rlog is a transformation which can perform better in these cases. As shown in the vignette, the function meanSdPlot from the package vsn can be used to see whether this is a problem.

NB! vst() is a wrapper for the varianceStabilizingTransformation() vst() provides much faster estimation of the dispersion trend used to determine the formula for the VST. The speed-up is accomplished by subsetting to a smaller number of genes in order to estimate this dispersion trend. The subset of genes is chosen deterministically, to span the range of genes’ mean normalized count. This wrapper for the VST is not blind to the experimental design: the sample covariate information is used to estimate the global trend of genes’ dispersion values over the genes’ mean normalized count. It can be made strictly blind to experimental design by first assigning a design of ~1 before running this function, or by avoiding subsetting and using varianceStabilizingTransformation.

fitType:

In case dispersions have not yet been estimated for object, this parameter is passed on to estimateDispersions().

If estimateDispersions was called with:

- fitType="parametric", a closed-form expression for the variance stabilizing transformation is used on the normalized count data. 
- fitType="local", the reciprocal of the square root of the variance of the normalized counts, as derived from the dispersion fit, is then numerically integrated, and the integral (approximated by a spline function) is evaluated for each count value in the column, yielding a transformed value.
- fitType="mean", a VST is applied for Negative Binomial distributed counts, ’k’, with a fixed dispersion, ’a’: ( 2 asinh(sqrt(a k)) - log(a) - log(4) )/log(2).

In all cases, the transformation is scaled such that for large counts, it becomes asymptotically (for large values) equal to the logarithm to base 2 of normalized counts.

transformed counts - vst Blind = TRUE

vsdBlindTrue <- DESeq2::varianceStabilizingTransformation(dds, blind = TRUE, fitType = "parametric")

boxplot(assay(vsdBlindTrue), col=meta$group)

transformed counts - vst Blind = FALSE

vsdBlindFalse <- DESeq2::varianceStabilizingTransformation(dds, blind = FALSE, fitType = "parametric")

boxplot(assay(vsdBlindFalse), col=meta$group)

rlogged transformation

Useful for various unsupervised clustering analyses.

Warning message (but not an error): “rlog() may take a few minutes with 30 or more samples, vst() is a much faster transformation”

transformed counts - rlogged Blind = TRUE

rlogged.BlindTrue = rlogTransformation(dds, blind = TRUE)

boxplot(assay(rlogged.BlindTrue), col=meta$group)

transformed counts - rlogged Blind = FALSE

rlogged.BlindFalse = rlogTransformation(dds, blind = FALSE)

boxplot(assay(rlogged.BlindFalse), col=meta$group)

Effects of transformations on the variance

The figure below plots the standard deviation of the transformed data, across samples, against the mean, using the shifted logarithm transformation, the regularized log transformation and the variance stabilizing transformation. The shifted logarithm has elevated standard deviation in the lower count range, and the regularized log to a lesser extent, while for the variance stabilized data the standard deviation is roughly constant along the whole dynamic range.

Note that the vertical axis in such plots is the square root of the variance over all samples, so including the variance due to the experimental conditions. While a flat curve of the square root of variance over the mean may seem like the goal of such transformations, this may be unreasonable in the case of datasets with many true differences due to the experimental conditions.

Plot row standard deviations versus row means

The scatterplot of these versus each other allows you to visually verify whether there is a dependence of the standard deviation (or variance) on the mean. The red line depicts the running median estimator (window-width 10%). If there is no variance-mean dependence, then the line should be approximately horizontal.

# this gives log2(n + 1)
ntd <- normTransform(dds)
meanSdPlot(assay(ntd))

meanSdPlot(assay(vsdBlindTrue))

meanSdPlot(assay(vsdBlindFalse))

meanSdPlot(assay(rlogged.BlindTrue))

meanSdPlot(assay(rlogged.BlindFalse)) 

Cluster dendrogram - rlog Blind = FALSE

dists.rlog.BlindFalse <- dist(t(assay(rlogged.BlindFalse)))
plot(hclust(dists.rlog.BlindFalse))

Cluster dendrogram - vst Blind = FALSE

dists.vsdBlindFalse <- dist(t(assay(vsdBlindFalse)))
plot(hclust(dists.vsdBlindFalse))

save transformed count data

The input file has to contain all the genes, not just differentially expressed ones. Note that you can use the resulting transformed values only for visualization and clustering (not for differential expression analysis which needs raw counts)

# save varianceStabilizingTransformation counts
saveRDS(vsdBlindFalse, file = "counts_vst.BlindFalse_Sid_Host.rds")

# later, after you already created the file, can load the existing file:
#vsdBlindFalse <- readRDS("counts_vst.BlindFalse_Sid_Host.rds")

Obtain normalized counts data

To perform the median of ratios method of normalization, DESeq2 has a single estimateSizeFactors() function that will generate size factors for us. We will use the function in the example below, but in a typical RNA-seq analysis this step is automatically performed by the DESeq2() function.

dds <- estimateSizeFactors(dds)
plot(sort(sizeFactors(dds)))

Normalization factor applied to each sample

#sort.list(sizeFactors(dds))
sizeFactors(dds)
##  PC_349_C_Sid_yr2 PC_350_La_Sid_yr2  PC_362_C_Sid_yr2 PC_363_La_Sid_yr2 
##         0.6288406         0.3499870         0.5330218         0.3268663 
##  PC_365_C_Sid_yr2 PC_366_La_Sid_yr2  PC_368_C_Sid_yr2  PC_371_C_Sid_yr2 
##         1.2246349         1.0585559         1.1701230         1.6479362 
## PC_372_La_Sid_yr2  PC_374_C_Sid_yr2 PC_375_La_Sid_yr2  PO_331_C_Sid_yr2 
##         1.5154535         0.7998800         1.0527066         0.3431146 
## PO_332_La_Sid_yr2  PO_334_C_Sid_yr2 PO_335_La_Sid_yr2  PO_337_C_Sid_yr2 
##         0.8407535         1.3141149         0.3883872         1.5377509 
## PO_338_La_Sid_yr2  PO_340_C_Sid_yr2 PO_341_La_Sid_yr2  PO_343_C_Sid_yr2 
##         0.2146360         0.8724101         0.9385663         0.9275344 
## PO_344_La_Sid_yr2 PO_347_La_Sid_yr2   R_003_C_Sid_yr2  R_004_La_Sid_yr2 
##         0.9186029         1.5227863         0.7500184         1.4906527 
##   R_009_C_Sid_yr2  R_010_La_Sid_yr2   R_012_C_Sid_yr2  R_013_La_Sid_yr2 
##         1.5810464         1.1541862         2.6203744         4.1189805 
##   R_015_C_Sid_yr2  R_019_La_Sid_yr2   R_021_C_Sid_yr2  R_022_La_Sid_yr2 
##         1.2460632         1.0793740         0.7245243         1.3759401 
##   R_024_C_Sid_yr2  R_025_La_Sid_yr2   R_027_C_Sid_yr2   R_030_C_Sid_yr2 
##         1.8817470         6.8672884         1.0980501         0.9317269 
##  R_031_La_Sid_yr2 
##         2.0511933
order(sizeFactors(dds), decreasing = TRUE)
##  [1] 34 28 27 37 33  8 25 16 22  9 24 32 14 29  5  7 26 35 30  6 11 19 36 20 21
## [26] 18 13 10 23 31  1  3 15  2 12  4 17

NOTE: DESeq2 doesn’t actually use normalized counts, rather it uses the raw counts and models the normalization inside the Generalized Linear Model (GLM).

These normalized counts will be useful for downstream visualization of results, but cannot be used as input to DESeq2 or any other tools that perform differential expression analysis which use the negative binomial model.

Normalized counts

# to retrieve the normalized counts matrix from dds
normalized_counts <- counts(dds, normalized=TRUE)

rownames(normalized_counts) <- rownames(host_means_filtered)

# Save normalized counts table as R data file for later use
saveRDS(normalized_counts, file = "counts_normalized_Sid_Host.rds")

normalized_counts <- as.data.frame.array(normalized_counts)
setDT(normalized_counts, keep.rownames = "gene")

# Save normalized data matrix for later use:
write.csv(normalized_counts, "counts_normalized_Sid_Host.csv", row.names = F)

Total number of raw counts per sample

colSums(counts(dds))[order(colSums(counts(dds)))]
## PC_350_La_Sid_yr2  PO_331_C_Sid_yr2 PO_335_La_Sid_yr2  PC_349_C_Sid_yr2 
##            380798            463600            600042            704992 
##   R_021_C_Sid_yr2   R_003_C_Sid_yr2  PO_340_C_Sid_yr2  PC_374_C_Sid_yr2 
##            887657            909951            913719            917294 
## PO_344_La_Sid_yr2  R_022_La_Sid_yr2  PC_362_C_Sid_yr2 PO_338_La_Sid_yr2 
##            975146            980686            981553           1045346 
## PC_363_La_Sid_yr2   R_024_C_Sid_yr2  PO_334_C_Sid_yr2 PO_341_La_Sid_yr2 
##           1064916           1197661           1215148           1218324 
##   R_027_C_Sid_yr2  PO_343_C_Sid_yr2  R_010_La_Sid_yr2   R_015_C_Sid_yr2 
##           1230966           1234566           1263000           1286867 
##  PO_337_C_Sid_yr2 PC_366_La_Sid_yr2  PC_368_C_Sid_yr2  R_031_La_Sid_yr2 
##           1403241           1425275           1430057           1439884 
##  PC_371_C_Sid_yr2 PO_332_La_Sid_yr2   R_030_C_Sid_yr2 PC_375_La_Sid_yr2 
##           1445940           1449021           1468620           1479401 
##  R_025_La_Sid_yr2  R_004_La_Sid_yr2   R_009_C_Sid_yr2   R_012_C_Sid_yr2 
##           1488717           1497031           1590008           1608066 
##  PC_365_C_Sid_yr2 PO_347_La_Sid_yr2 PC_372_La_Sid_yr2  R_019_La_Sid_yr2 
##           1615318           1635270           1672597           1709900 
##  R_013_La_Sid_yr2 
##           1737655

Barplot raw counts

colSums(counts(dds)) %>% barplot(col=meta$group)

Total number of normalized counts per sample

# How do the numbers correlate with the size factor?
# Now take a look at the total depth after normalization using:
#colSums(counts(dds, normalized=T))

colSums(counts(dds, normalized=T))[order(colSums(counts(dds, normalized=T)))]
##  R_025_La_Sid_yr2  R_013_La_Sid_yr2   R_012_C_Sid_yr2   R_024_C_Sid_yr2 
##          216783.8          421865.3          613677.9          636462.3 
##  R_031_La_Sid_yr2  R_022_La_Sid_yr2  PC_371_C_Sid_yr2  PO_337_C_Sid_yr2 
##          701973.8          712738.9          877424.7          912528.2 
##  PO_334_C_Sid_yr2  R_004_La_Sid_yr2   R_009_C_Sid_yr2   R_015_C_Sid_yr2 
##          924689.3         1004278.8         1005668.1         1032746.1 
##  PO_340_C_Sid_yr2 PO_344_La_Sid_yr2 PO_347_La_Sid_yr2 PC_350_La_Sid_yr2 
##         1047350.3         1061553.4         1073867.0         1088034.8 
##  R_010_La_Sid_yr2 PC_372_La_Sid_yr2   R_027_C_Sid_yr2  PC_349_C_Sid_yr2 
##         1094277.5         1103694.0         1121047.2         1121098.1 
##  PC_374_C_Sid_yr2   R_003_C_Sid_yr2  PC_368_C_Sid_yr2   R_021_C_Sid_yr2 
##         1146789.5         1213238.3         1222142.4         1225158.4 
## PO_341_La_Sid_yr2  PC_365_C_Sid_yr2  PO_343_C_Sid_yr2 PC_366_La_Sid_yr2 
##         1298069.3         1319020.0         1331019.1         1346433.4 
##  PO_331_C_Sid_yr2 PC_375_La_Sid_yr2 PO_335_La_Sid_yr2   R_030_C_Sid_yr2 
##         1351152.2         1405330.8         1544958.2         1576234.4 
##  R_019_La_Sid_yr2 PO_332_La_Sid_yr2  PC_362_C_Sid_yr2 PC_363_La_Sid_yr2 
##         1584159.0         1723479.0         1841487.6         3257955.5 
## PO_338_La_Sid_yr2 
##         4870320.2

Barplot Normalized counts

colSums(counts(dds, normalized = T)) %>% barplot(col=meta$group)

Quality Control

A useful initial step in an RNA-seq analysis is often to assess overall similarity between samples:

To explore the similarity of our samples, we will be performing sample-level QC using Principal Component Analysis (PCA) and hierarchical clustering methods. Our sample-level QC allows us to see how well our replicates cluster together, as well as, observe whether our experimental condition represents the major source of variation in the data. Performing sample-level QC can also identify any sample outliers, which may need to be explored further to determine whether they need to be removed prior to DE analysis.

When using these unsupervised clustering methods, log2-transformation of the normalized counts improves the distances/clustering for visualization. DESeq2 uses a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering.

Here I use varianceStablizingTransformation() transformed counts

Principal Component Analysis (PCA)

Exploratory

By default the function uses the top 500 most variable genes. You can change this by adding the ntop argument and specifying how many genes you want to use to draw the plot.

PCA - Group

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c( "group"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("group")) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = group), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = group)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

PCA - Origin

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Origin_name"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Origin_name") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Origin_name), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = Origin_name)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

PCA - Destination

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Destination_name"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Destination_name") ) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Destination_name), size = 2) +
      stat_ellipse(geom = "polygon", alpha = 1/10, aes(fill = Destination_name)) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

PCA - Colony

pcadata = DESeq2::plotPCA(vsdBlindFalse, intgroup = c("Colony_ID"), returnData = TRUE)
percentVar = round(100 * attr(pcadata, "percentVar"))
pca = prcomp(t(assay(vsdBlindFalse)), center = TRUE, scale. = FALSE)

DESeq2::plotPCA(vsdBlindFalse, returnData = TRUE, intgroup = c("Colony_ID")) %>% 
      ggplot(aes(x = PC1, y = PC2)) +
      geom_point(aes(colour = Colony_ID), size = 2) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      theme_cowplot()

Hierarchical Clustering

Since there is no built-in function for heatmaps in DESeq2 we will be using the pheatmap() function from the pheatmap package. This function requires a matrix/dataframe of numeric values as input, and so the first thing we need to is retrieve that information from the rld object:

Extract the rlog matrix from the object

rld_mat <- assay(vsdBlindFalse) 

Compute pairwise correlation values

rld_cor <- cor(rld_mat)    
#head(rld_cor)   ## check the output of cor(), make note of the rownames and colnames

Plot the correlation values as a heatmap:

pheatmap(rld_cor)

Correlation of gene expression for all pairwise combinations of samples

Hierarchical Clustering Heatmap

This figure takes overall expression and clusters samples based on euclidean distances.

# sample by distance heatmap
sampleDists <- as.matrix(dist(t(assay(vsdBlindFalse))))
heatmap.2(as.matrix(sampleDists), key=F, trace="none",
          col=colorpanel(100, "black", "white"),
          margin=c(10, 10))

Gene-level QC

In addition to examining how well the samples/replicates cluster together, there are a few more QC steps. Prior to differential expression analysis it is beneficial to omit genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes. The genes omitted fall into three categories:

- Genes with zero counts in all samples
- Genes with an extreme count outlier
- Genes with a low mean normalized counts (independent filtering)

DESeq2 will perform this filtering by default; however other DE tools, such as EdgeR will not. Filtering is a necessary step, even if you are using limma-voom and/or edgeR’s quasi-likelihood methods. Be sure to follow pre-filtering steps when using other tools, as outlined in their user guides found on Bioconductor as they generally perform much better.

Differential expression analysis

based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution

Function performs a default analysis through the steps: - estimation of size factors: estimateSizeFactors - estimation of dispersion: estimateDispersions - Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest

In this step we essentially want to determine whether the mean expression levels of different sample groups are significantly different.

Wald test: the shrunken estimate of LFC is divided by its standard error, resulting in a z-statistic, which is compared to a standard normal distribution.

In DESeq2, we assume that genes of similar average expression strength have similar dispersion. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8

  • We first treat each gene separately and estimate gene-wise dispersion estimates (using maximum likelihood), which rely only on the data of each individual gene (black dots in Figure 1).
  • Next, we determine the location parameter of the distribution of these estimates; to allow for dependence on average expression strength, we fit a smooth curve, as shown by the red line in Figure 1.
  • This provides an accurate estimate for the expected dispersion value for genes of a given expression strength but does not represent deviations of individual genes from this overall trend.
  • We then shrink the gene-wise dispersion estimates toward the values predicted by the curve to obtain final dispersion values (blue arrow heads).
  • We use an empirical Bayes approach, which lets the strength of shrinkage depend (i) on an estimate of how close true dispersion values tend to be to the fit and (ii) on the degrees of freedom: as the sample size increases, the shrinkage decreases in strength, and eventually becomes negligible.
  • Our approach therefore accounts for gene-specific variation to the extent that the data provide this information, while the fitted curve aids estimation and testing in less information-rich settings.

fitType = parameteric

  • a closed-form expression for the variance stabilizing transformation (vst) is used on the normalized count data.
  • a dispersion-mean relation of the form:

    dispersion = asymptDisp + extraPois / mean

  • via a robust gamma-family GLM.
  • coefficients asymptDisp and extraPois are given in the attribute coefficients of the dispersionFunction of the object.

# Generate normalized counts
# perform the median of ratios method of normalization
# perform a Wald test in DESeq2 pairwise comparison between treatment effects

dds <- DESeq2::DESeq(dds, fitType = "parametric", test = "Wald")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 13 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
# copy here text from below:
# -- replacing outliers and refitting for 31 genes
# -- DESeq argument 'minReplicatesForReplace' = 7 
# -- original counts are preserved in counts(dds)
results.dds = results(dds)
head(results.dds)
## log2 fold change (MLE): group Reef.Ojo vs Lagoon.Control 
## Wald test p-value: group Reef.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      3.9670995
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116     -0.2529052
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500      0.3448257
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688     -0.0899576
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677     -0.3697414
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096     -0.9537838
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.213859  3.268171
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.580922 -0.435352
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.638094  0.540400
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.749471 -0.120028
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.710034 -0.520737
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.526393 -1.811924
##                                                        pvalue      padj
##                                                     <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.00108245 0.0204451
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.66330720 0.8179578
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.58892131 0.7643316
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.90446086 0.9527099
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.60254986 0.7730806
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.06999799 0.2363910
dispersionFunction(dds)
## function (q) 
## coefs[1] + coefs[2]/q
## <bytecode: 0x7fab9b2fd638>
## <environment: 0x7fab99ddddb8>
## attr(,"coefficients")
## asymptDisp  extraPois 
##  0.2642143  1.2764044 
## attr(,"fitType")
## [1] "parametric"
## attr(,"varLogDispEsts")
## [1] 0.2631626
## attr(,"dispPriorVar")
## [1] 0.25

Fit curve to gene-wise dispersion estimates

This curve is displayed as a red line in the figure below, which plots the estimate for the expected dispersion value for genes of a given expression strength. Each black dot is a gene with an associated mean expression level and maximum likelihood estimation (MLE) of the dispersion

The curve allows for more accurate identification of differentially expressed genes when sample sizes are small, and the strength of the shrinkage for each gene depends on :

- how close gene dispersions are from the curve
- sample size (more samples = less shrinkage)

This shrinkage method is particularly important to reduce false positives in the differential expression analysis. Genes with low dispersion estimates are shrunken towards the curve, and the more accurate, higher shrunken values are output for fitting of the model and differential expression testing.

Dispersion estimates that are slightly above the curve are also shrunk toward the curve for better dispersion estimation; however, genes with extremely high dispersion values are not. This is due to the likelihood that the gene does not follow the modeling assumptions and has higher variability than others for biological or technical reasons [1]. Shrinking the values toward the curve could result in false positives, so these values are not shrunken. These genes are shown surrounded by blue circles below.

This is a good plot to examine to ensure your data is a good fit for the DESeq2 model. You expect your data to generally scatter around the curve, with the dispersion decreasing with increasing mean expression levels. If you see a cloud or different shapes, then you might want to explore your data more to see if you have contamination (mitochondrial, etc.) or outlier samples. Note how much shrinkage you get across the whole range of means in the plotDispEsts() plot for any experiment with low degrees of freedom.

Plot dispersion estimates - fitType = “parametric”

NB! “parametric” is the default, unless the fit fails, then automatically will be replaced with another fitType (if applicable, will see this in Warning message after running DESeq() function above)

# already accomplished as part of DESeq() function (above):
#dds <- DESeq2::estimateSizeFactors(dds)
#dds <- DESeq2::estimateDispersionsGeneEst(dds)
#dds <- DESeq2::estimateDispersionsFit(dds)
#dds <- DESeq2::estimateDispersionsMAP(dds)

DESeq2::plotDispEsts(dds)

head(dispersions(dds))  # parametric
## [1] 1.1498762 0.3412903 0.4131060 0.6526407 0.5770527 0.3672592

All of the intermediate values (gene-wise dispersion estimates, fitted dispersion estimates from the trended fit, etc.) are stored in mcols(dds), with information about these columns in mcols(mcols(dds)).

rownames(mcols(mcols(dds)))
##  [1] "baseMean"                                          
##  [2] "baseVar"                                           
##  [3] "allZero"                                           
##  [4] "dispGeneEst"                                       
##  [5] "dispGeneIter"                                      
##  [6] "dispFit"                                           
##  [7] "dispersion"                                        
##  [8] "dispIter"                                          
##  [9] "dispOutlier"                                       
## [10] "dispMAP"                                           
## [11] "Intercept"                                         
## [12] "group_Lagoon.Ojo_vs_Lagoon.Control"                
## [13] "group_Ojo.Control_vs_Lagoon.Control"               
## [14] "group_Ojo.Ojo_vs_Lagoon.Control"                   
## [15] "group_Reef.Control_vs_Lagoon.Control"              
## [16] "group_Reef.Ojo_vs_Lagoon.Control"                  
## [17] "SE_Intercept"                                      
## [18] "SE_group_Lagoon.Ojo_vs_Lagoon.Control"             
## [19] "SE_group_Ojo.Control_vs_Lagoon.Control"            
## [20] "SE_group_Ojo.Ojo_vs_Lagoon.Control"                
## [21] "SE_group_Reef.Control_vs_Lagoon.Control"           
## [22] "SE_group_Reef.Ojo_vs_Lagoon.Control"               
## [23] "WaldStatistic_Intercept"                           
## [24] "WaldStatistic_group_Lagoon.Ojo_vs_Lagoon.Control"  
## [25] "WaldStatistic_group_Ojo.Control_vs_Lagoon.Control" 
## [26] "WaldStatistic_group_Ojo.Ojo_vs_Lagoon.Control"     
## [27] "WaldStatistic_group_Reef.Control_vs_Lagoon.Control"
## [28] "WaldStatistic_group_Reef.Ojo_vs_Lagoon.Control"    
## [29] "WaldPvalue_Intercept"                              
## [30] "WaldPvalue_group_Lagoon.Ojo_vs_Lagoon.Control"     
## [31] "WaldPvalue_group_Ojo.Control_vs_Lagoon.Control"    
## [32] "WaldPvalue_group_Ojo.Ojo_vs_Lagoon.Control"        
## [33] "WaldPvalue_group_Reef.Control_vs_Lagoon.Control"   
## [34] "WaldPvalue_group_Reef.Ojo_vs_Lagoon.Control"       
## [35] "betaConv"                                          
## [36] "betaIter"                                          
## [37] "deviance"                                          
## [38] "maxCooks"                                          
## [39] "replace"

Distribution of residuals

dispersions(dds) %>% hist(breaks = 500)

Differential expression analysis with DESeq2: model fitting and hypothesis testing

Evaluate contrasts

# It can be useful to include the sample names in the data set object:
rownames(dds) <- rownames(host_means_filtered)
levels(dds$group)
## [1] "Lagoon.Control" "Lagoon.Ojo"     "Ojo.Control"    "Ojo.Ojo"       
## [5] "Reef.Control"   "Reef.Ojo"

Note that the ‘results’ function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha.

Pairwise contrasts

In each pairwise contrast, we should have the treatment condition first, and the control condition second in the log2 fold change (MLE) output.

Groups: “Lagoon.Control” “Lagoon.Ojo” “Ojo.Control” “Ojo.Ojo”
“Reef.Control” “Reef.Ojo”

LO_LC “Lagoon.Ojo”, “Lagoon.Control”

This contrast does not make sense “Lagoon.Ojo”, “Ojo.Control”

OO_LO “Ojo.Ojo”, “Lagoon.Ojo”

This contrast does not make sense “Lagoon.Ojo”, “Reef.Control”

LO_RO “Lagoon.Ojo”, “Reef.Ojo”

OO_OC “Ojo.Ojo”, “Ojo.Control”

low pH vs. ambient pH, but not comparative sites OO_RC “Ojo.Ojo”, “Reef.Control”

low pH vs. ambient pH, but not comparative sites OO_LC “Ojo.Ojo”, “Lagoon.Control”

OO_RO “Ojo.Ojo”, “Reef.Ojo”

RO_RC “Reef.Ojo”, “Reef.Control”

This contrast does not make sense “Reef.Ojo”, “Lagoon.Control”

This contrast does not make sense “Reef.Ojo”, “Ojo.Control”

OC_LC “Ojo.Control”, “Lagoon.Control”

OC_RC “Ojo.Control”, “Reef.Control”

LC_RC “Lagoon.Control”, “Reef.Control”

“Lagoon.Ojo”, “Lagoon.Control”

p-adj < 0.1

results_LO_LC_.1 <- results(dds, contrast = c("group", "Lagoon.Ojo", "Lagoon.Control"), alpha = 0.1)
head(results_LO_LC_.1)
## log2 fold change (MLE): group Lagoon.Ojo vs Lagoon.Control 
## Wald test p-value: group Lagoon.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      -0.116436
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116      -0.786266
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500       0.202177
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688       0.119849
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      -0.289529
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096      -0.277034
##                                                        lfcSE       stat
##                                                    <numeric>  <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.550440 -0.0750985
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.732360 -1.0736063
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.749912  0.2696004
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.858614  0.1395840
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.829492 -0.3490435
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.588747 -0.4705482
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.940136         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.282999         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.787468         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.888989         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.727057         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.637963         1

Zero differentially expressed genes

summary(results_LO_LC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 0, 0%
## LFC < 0 (down)     : 1, 0.013%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

arrange by lowest p-values

This is later used as input for PERMANOVA

### convert to dataframe (from DESeq2 object)
res_LO_LC_.1_df <- data.frame(results_LO_LC_.1)

DE_LO_LC_pvalue <- res_LO_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_LO_LC_pvalue)
## # A tibble: 6 x 7
##   gene                       baseMean log2FoldChange lfcSE  stat   pvalue   padj
##   <chr>                         <dbl>          <dbl> <dbl> <dbl>    <dbl>  <dbl>
## 1 Siderastrea_Mexico_Barshi…   615.            -6.16 1.32  -4.68  2.82e-6 0.0225
## 2 Siderastrea_Mexico_Barshi…    57.7           -1.36 0.360 -3.77  1.64e-4 0.656 
## 3 Siderastrea_Mexico_Barshi…    36.4           -1.81 0.498 -3.63  2.80e-4 0.747 
## 4 Siderastrea_Mexico_Barshi…    60.6            2.56 0.726  3.52  4.24e-4 0.847 
## 5 Siderastrea_Mexico_Barshi…     8.12           2.28 0.663  3.44  5.78e-4 0.924 
## 6 Siderastrea_Mexico_Barshi…     3.14           3.73 1.11   3.38  7.26e-4 0.968
# save rlogged and csv file
saveRDS(DE_LO_LC_pvalue, file = "Sid_Host_LO_LC_pval.05.rds")
write.csv(DE_LO_LC_pvalue, "Sid_Host_LO_LC_pval.05.csv")
dim(DE_LO_LC_pvalue)
## [1] 233   7

“Ojo.Ojo”, “Lagoon.Ojo”

p-adj < 0.1

results_OO_LO_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Lagoon.Ojo"), alpha = 0.1)
head(results_OO_LO_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Ojo 
## Wald test p-value: group Ojo.Ojo vs Lagoon.Ojo 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002       1.913383
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116       0.756261
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500       0.516995
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688       1.040616
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677       0.476503
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096      -0.346128
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.466266  1.304936
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.752404  1.005127
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.739801  0.698830
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.839676  1.239306
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.841679  0.566134
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.610867 -0.566617
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.191915  0.999866
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.314836  0.999866
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.484658  0.999866
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.215232  0.999866
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.571303  0.999866
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.570974  0.999866
summary(results_OO_LO_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2, 0.025%
## LFC < 0 (down)     : 0, 0%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_OO_LO_.1_df <- data.frame(results_OO_LO_.1)

DE_OO_LO_padj.1 <- res_OO_LO_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_LO_padj.1)
## # A tibble: 2 x 7
##   gene                      baseMean log2FoldChange lfcSE  stat   pvalue    padj
##   <chr>                        <dbl>          <dbl> <dbl> <dbl>    <dbl>   <dbl>
## 1 Siderastrea_Mexico_Barsh…     43.3           4.50 0.913  4.93  8.40e-7 0.00672
## 2 Siderastrea_Mexico_Barsh…     47.5           1.79 0.401  4.47  7.97e-6 0.0319
# save rlogged and csv file
saveRDS(DE_OO_LO_padj.1, file = "Sid_Host_DE_genes_OO_LO_padj.1.rds")
write.csv(DE_OO_LO_padj.1, "Sid_Host_DE_genes_OO_LO_padj.1.csv")

Pairwise contrast up- vs. down-regulated genes

up.results_OO_LO_.1 = row.names(results_OO_LO_.1[results_OO_LO_.1$padj<0.1 & !(is.na(results_OO_LO_.1$padj)) & results_OO_LO_.1$log2FoldChange>0,])

up.results_OO_LO_.1 <- as.data.frame(up.results_OO_LO_.1)
up.results_OO_LO_.1$DEG <- 'Up'

OO_LO_padj.1_summary <- up.results_OO_LO_.1
dim(OO_LO_padj.1_summary)
## [1] 2 2
# save file
write.csv(OO_LO_padj.1_summary, file="Sid_Host_DE_genes_OO_LO_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_LO_pvalue <- res_OO_LO_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_LO_pvalue)
## # A tibble: 6 x 7
##   gene                      baseMean log2FoldChange lfcSE  stat   pvalue    padj
##   <chr>                        <dbl>          <dbl> <dbl> <dbl>    <dbl>   <dbl>
## 1 Siderastrea_Mexico_Barsh…    43.3            4.50 0.913  4.93  8.40e-7 0.00672
## 2 Siderastrea_Mexico_Barsh…    47.5            1.79 0.401  4.47  7.97e-6 0.0319 
## 3 Siderastrea_Mexico_Barsh…    13.4           -3.36 0.838 -4.01  6.15e-5 0.164  
## 4 Siderastrea_Mexico_Barsh…     4.11          -3.97 1.03  -3.86  1.12e-4 0.189  
## 5 Siderastrea_Mexico_Barsh…    13.0            2.61 0.677  3.85  1.18e-4 0.189  
## 6 Siderastrea_Mexico_Barsh…     3.20           4.56 1.22   3.73  1.94e-4 0.227
# save rlogged and csv file
saveRDS(DE_OO_LO_pvalue, file = "Sid_Host_OO_LO_pval.05.rds")
write.csv(DE_OO_LO_pvalue, "Sid_Host_OO_LO_pval.05.csv")
dim(DE_OO_LO_pvalue)
## [1] 331   7

“Lagoon.Ojo”, “Reef.Ojo”

p-adj < 0.1

results_LO_RO_.1 <- results(dds, contrast = c("group", "Lagoon.Ojo", "Reef.Ojo"), alpha = 0.1)
head(results_LO_RO_.1)
## log2 fold change (MLE): group Lagoon.Ojo vs Reef.Ojo 
## Wald test p-value: group Lagoon.Ojo vs Reef.Ojo 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002     -4.0835352
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116     -0.5333611
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500     -0.1426492
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688      0.2098063
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      0.0802126
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096      0.6767501
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.300770 -3.139322
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.693154 -0.769470
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.687184 -0.207585
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.807375  0.259862
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.787823  0.101815
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.569509  1.188305
##                                                        pvalue      padj
##                                                     <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.00169339        NA
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.44161462  0.702386
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.83555290  0.932972
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.79496997  0.911446
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.91890312  0.968628
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.23471328  0.528991
summary(results_LO_RO_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 174, 2.2%
## LFC < 0 (down)     : 347, 4.3%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 1705, 21%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_LO_RO_.1_df <- data.frame(results_LO_RO_.1)

DE_LO_RO_padj.1 <- res_LO_RO_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_LO_RO_padj.1)
## # A tibble: 6 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat     pvalue    padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>      <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bar…     13.4           4.68 0.791  5.92    3.29e-9 2.07e-5
## 2 Siderastrea_Mexico_Bar…     55.3          -3.45 0.599 -5.77    8.05e-9 2.53e-5
## 3 Siderastrea_Mexico_Bar…    270.           -5.05 0.888 -5.69    1.30e-8 2.73e-5
## 4 Siderastrea_Mexico_Bar…    181.           -1.54 0.280 -5.50    3.88e-8 6.10e-5
## 5 Siderastrea_Mexico_Bar…    206.           -1.99 0.368 -5.42    6.13e-8 7.71e-5
## 6 Siderastrea_Mexico_Bar…     15.5          -3.81 0.709 -5.37    7.81e-8 8.19e-5
# save rlogged and csv file
saveRDS(DE_LO_RO_padj.1, file = "Sid_Host_DE_genes_LO_RO_padj.1.rds")
write.csv(DE_LO_RO_padj.1, "Sid_Host_DE_genes_LO_RO_padj.1.csv")

Pairwise contrast up- vs. down-regulated genes

up.results_LO_RO_padj.1 = row.names(results_LO_RO_.1[results_LO_RO_.1$padj<0.1 & !(is.na(results_LO_RO_.1$padj)) & results_LO_RO_.1$log2FoldChange>0,])

up.results_LO_RO_padj.1 <- as.data.frame(up.results_LO_RO_padj.1)
up.results_LO_RO_padj.1$DEG <- 'Up'

down.results_LO_RO_padj.1 = row.names(results_LO_RO_.1[results_LO_RO_.1$padj<0.1 & !(is.na(results_LO_RO_.1$padj)) & results_LO_RO_.1$log2FoldChange<0,])

down.results_LO_RO_padj.1 <- as.data.frame(down.results_LO_RO_padj.1)
down.results_LO_RO_padj.1$DEG <- 'Down'

LO_RO_padj.1_summary <- merge(up.results_LO_RO_padj.1, down.results_LO_RO_padj.1, all=T)
dim(LO_RO_padj.1_summary)
## [1] 521   3
# save file
write.csv(LO_RO_padj.1_summary, file="Sid_Host_DE_genes_LO_RO_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_LO_RO_pvalue <- res_LO_RO_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_LO_RO_pvalue)
## # A tibble: 6 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat     pvalue    padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>      <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bar…     13.4           4.68 0.791  5.92    3.29e-9 2.07e-5
## 2 Siderastrea_Mexico_Bar…     55.3          -3.45 0.599 -5.77    8.05e-9 2.53e-5
## 3 Siderastrea_Mexico_Bar…    270.           -5.05 0.888 -5.69    1.30e-8 2.73e-5
## 4 Siderastrea_Mexico_Bar…    181.           -1.54 0.280 -5.50    3.88e-8 6.10e-5
## 5 Siderastrea_Mexico_Bar…    206.           -1.99 0.368 -5.42    6.13e-8 7.71e-5
## 6 Siderastrea_Mexico_Bar…     15.5          -3.81 0.709 -5.37    7.81e-8 8.19e-5
# save rlogged and csv file
saveRDS(DE_LO_RO_pvalue, file = "Sid_Host_LO_RO_pval.05.rds")
write.csv(DE_LO_RO_pvalue, "Sid_Host_LO_RO_pval.05.csv")
dim(DE_LO_RO_pvalue)
## [1] 1424    7

“Ojo.Ojo”, “Ojo.Control”

p-adj < 0.1

results_OO_OC_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Ojo.Control"), alpha = 0.1)
head(results_OO_OC_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Ojo.Control 
## Wald test p-value: group Ojo.Ojo vs Ojo.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      1.6439316
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116      0.0828093
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500     -0.9404626
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688      0.3405084
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      0.3090180
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096     -0.2257350
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.466266  1.121169
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.688148  0.120337
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.670845 -1.401907
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.796836  0.427325
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.814527  0.379383
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.605575 -0.372761
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.262216         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.904217         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.160943         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.669142         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.704403         1
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.709326         1
summary(results_OO_OC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 0, 0%
## LFC < 0 (down)     : 1, 0.013%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_OO_OC_.1_df <- data.frame(results_OO_OC_.1)

DE_OO_OC_padj.1 <- res_OO_OC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_OC_padj.1)
## # A tibble: 1 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat     pvalue    padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>      <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bar…     831.          -8.19  1.41 -5.81    6.08e-9 4.87e-5
# save rlogged and csv file
saveRDS(DE_OO_OC_padj.1, file = "Sid_Host_DE_genes_OO_OC_padj.1.rds")
write.csv(DE_OO_OC_padj.1, "Sid_Host_DE_genes_OO_OC_padj.1.csv")

Pairwise contrast up- vs. down-regulated genes

down.results_OO_OC_padj.1 = row.names(results_OO_OC_.1[results_OO_OC_.1$padj<0.1 & !(is.na(results_OO_OC_.1$padj)) & results_OO_OC_.1$log2FoldChange<0,])

down.results_OO_OC_padj.1 <- as.data.frame(down.results_OO_OC_padj.1)
down.results_OO_OC_padj.1$DEG <- 'Down'

OO_OC_padj.1_summary <- down.results_OO_OC_padj.1
dim(OO_OC_padj.1_summary)
## [1] 1 2
# save file
write.csv(OO_OC_padj.1_summary, file="Sid_Host_DE_genes_OO_OC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_OC_pvalue <- res_OO_OC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_OC_pvalue)
## # A tibble: 6 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat     pvalue    padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>      <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bar…   831.            -8.19 1.41  -5.81    6.08e-9 4.87e-5
## 2 Siderastrea_Mexico_Bar…   983.             1.77 0.469  3.77    1.60e-4 6.10e-1
## 3 Siderastrea_Mexico_Bar…     2.74           4.44 1.21   3.66    2.53e-4 6.10e-1
## 4 Siderastrea_Mexico_Bar…    11.3            2.47 0.695  3.55    3.89e-4 6.10e-1
## 5 Siderastrea_Mexico_Bar…     4.05           3.74 1.07   3.51    4.52e-4 6.10e-1
## 6 Siderastrea_Mexico_Bar…     6.70          -2.03 0.588 -3.45    5.69e-4 6.10e-1
# save rlogged and csv file
saveRDS(DE_OO_OC_pvalue, file = "Sid_Host_OO_OC_pval.05.rds")
write.csv(DE_OO_OC_pvalue, "Sid_Host_OO_OC_pval.05.csv")
dim(DE_OO_OC_pvalue)
## [1] 291   7

“Ojo.Ojo”, “Reef.Ojo”

p-adj < 0.1

results_OO_RO_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Reef.Ojo"), alpha = 0.1)
head(results_OO_RO_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Reef.Ojo 
## Wald test p-value: group Ojo.Ojo vs Reef.Ojo 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      -2.170152
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116       0.222900
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500       0.374346
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688       1.250422
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677       0.556715
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096       0.330622
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.104320 -1.965148
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.605995  0.367825
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.626180  0.597825
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.727699  1.718323
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.724234  0.768695
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.551022  0.600017
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.0493971  0.196005
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.7130039  0.855624
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.5499570  0.756581
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.0857378  0.275692
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.4420744  0.677514
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.5484949  0.755725
summary(results_OO_RO_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 374, 4.7%
## LFC < 0 (down)     : 801, 10%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_OO_RO_.1_df <- data.frame(results_OO_RO_.1)

DE_OO_RO_padj.1 <- res_OO_RO_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_RO_padj.1)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     55.3          -4.38 0.590 -7.42 1.14e-13 9.09e-10
## 2 Siderastrea_Mexico_Bars…    206.           -2.43 0.353 -6.90 5.33e-12 1.46e- 8
## 3 Siderastrea_Mexico_Bars…    181.           -1.85 0.268 -6.89 5.48e-12 1.46e- 8
## 4 Siderastrea_Mexico_Bars… 145894.            8.28 1.21   6.82 8.92e-12 1.78e- 8
## 5 Siderastrea_Mexico_Bars…     23.2          -4.01 0.630 -6.36 2.03e-10 3.25e- 7
## 6 Siderastrea_Mexico_Bars…    312.           -1.76 0.280 -6.29 3.24e-10 3.71e- 7
# save rlogged and csv files:
saveRDS(DE_OO_RO_padj.1, file = "Sid_Host_DE_genes_OO_RO_padj.1.rds")
write.csv(DE_OO_RO_padj.1, "Sid_Host_DE_genes_OO_RO_padj.1.csv")

Pairwise contrast up- vs. down-regulated genes

up.results_OO_RO_padj.1 = row.names(results_OO_RO_.1[results_OO_RO_.1$padj<0.1 & !(is.na(results_OO_RO_.1$padj)) & results_OO_RO_.1$log2FoldChange>0,])

up.results_OO_RO_padj.1 <- as.data.frame(up.results_OO_RO_padj.1)
up.results_OO_RO_padj.1$DEG <- 'Up'

down.results_OO_RO_padj.1 = row.names(results_OO_RO_.1[results_OO_RO_.1$padj<0.1 & !(is.na(results_OO_RO_.1$padj)) & results_OO_RO_.1$log2FoldChange<0,])

down.results_OO_RO_padj.1 <- as.data.frame(down.results_OO_RO_padj.1)
down.results_OO_RO_padj.1$DEG <- 'Down'

OO_RO_padj.1_summary <- merge(up.results_OO_RO_padj.1, down.results_OO_RO_padj.1, all=T)
dim(OO_RO_padj.1_summary)
## [1] 1175    3
# save file
write.csv(OO_RO_padj.1_summary, file="Sid_Host_DE_genes_OO_RO_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_RO_pvalue <- res_OO_RO_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_RO_pvalue)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     55.3          -4.38 0.590 -7.42 1.14e-13 9.09e-10
## 2 Siderastrea_Mexico_Bars…    206.           -2.43 0.353 -6.90 5.33e-12 1.46e- 8
## 3 Siderastrea_Mexico_Bars…    181.           -1.85 0.268 -6.89 5.48e-12 1.46e- 8
## 4 Siderastrea_Mexico_Bars… 145894.            8.28 1.21   6.82 8.92e-12 1.78e- 8
## 5 Siderastrea_Mexico_Bars…     23.2          -4.01 0.630 -6.36 2.03e-10 3.25e- 7
## 6 Siderastrea_Mexico_Bars…     13.4          -4.92 0.783 -6.29 3.17e-10 3.71e- 7
# save rlogged and csv files:
saveRDS(DE_OO_RO_pvalue, file = "Sid_Host_OO_RO_pval.05.rds")
write.csv(DE_OO_RO_pvalue, "Sid_Host_OO_RO_pval.05.csv")
dim(DE_OO_RO_pvalue)
## [1] 2023    7

“Reef.Ojo”, “Reef.Control”

p-adj < 0.1

results_RO_RC_.1 <- results(dds, contrast = c("group", "Reef.Ojo", "Reef.Control"), alpha = 0.1)
head(results_RO_RC_.1)
## log2 fold change (MLE): group Reef.Ojo vs Reef.Control 
## Wald test p-value: group Reef.Ojo vs Reef.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      0.9576749
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116      0.0782555
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500     -1.3079011
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688     -0.5522638
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      0.0275283
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096     -0.2025592
##                                                        lfcSE       stat
##                                                    <numeric>  <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.882472  1.0852186
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.534192  0.1464932
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.534578 -2.4466062
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.669752 -0.8245796
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.657503  0.0418679
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.494058 -0.4099908
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.2778249  0.560627
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.8835321  0.951212
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.0144208  0.120208
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.4096103  0.678110
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.9666040  0.985081
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.6818127  0.855287
summary(results_RO_RC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 436, 5.5%
## LFC < 0 (down)     : 386, 4.8%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_RO_RC_.1_df <- data.frame(results_RO_RC_.1)

DE_RO_RC_padj.1 <- res_RO_RC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_RO_RC_padj.1)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     61.5          -2.00 0.296 -6.76 1.42e-11  1.14e-7
## 2 Siderastrea_Mexico_Bars…     27.4          -2.20 0.336 -6.54 6.15e-11  2.46e-7
## 3 Siderastrea_Mexico_Bars…     70.6           1.80 0.284  6.32 2.55e-10  6.81e-7
## 4 Siderastrea_Mexico_Bars…     55.3           3.20 0.514  6.22 4.95e-10  9.89e-7
## 5 Siderastrea_Mexico_Bars…     36.0           1.83 0.322  5.69 1.26e- 8  2.01e-5
## 6 Siderastrea_Mexico_Bars…     32.8          -2.48 0.449 -5.52 3.39e- 8  4.52e-5
# save rlogged file
saveRDS(DE_RO_RC_padj.1, file = "Sid_Host_DE_genes_RO_RC_padj.1.rds")
write.csv(DE_RO_RC_padj.1, "Sid_Host_DE_genes_RO_RC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_RO_RC_padj.1 = row.names(results_RO_RC_.1[results_RO_RC_.1$padj<0.1 & !(is.na(results_RO_RC_.1$padj)) & results_RO_RC_.1$log2FoldChange>0,])

up.results_RO_RC_padj.1 <- as.data.frame(up.results_RO_RC_padj.1)
up.results_RO_RC_padj.1$DEG <- 'Up'

down.results_RO_RC_padj.1 = row.names(results_RO_RC_.1[results_RO_RC_.1$padj<0.1 & !(is.na(results_RO_RC_.1$padj)) & results_RO_RC_.1$log2FoldChange<0,])

down.results_RO_RC_padj.1 <- as.data.frame(down.results_RO_RC_padj.1)
down.results_RO_RC_padj.1$DEG <- 'Down'

RO_RC_padj.1_summary <- merge(up.results_RO_RC_padj.1, down.results_RO_RC_padj.1, all=T)
dim(RO_RC_padj.1_summary)
## [1] 822   3
# save file
write.csv(RO_RC_padj.1_summary, file="Sid_Host_DE_genes_RO_RC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_RO_RC_pvalue <- res_RO_RC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_RO_RC_pvalue)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     61.5          -2.00 0.296 -6.76 1.42e-11  1.14e-7
## 2 Siderastrea_Mexico_Bars…     27.4          -2.20 0.336 -6.54 6.15e-11  2.46e-7
## 3 Siderastrea_Mexico_Bars…     70.6           1.80 0.284  6.32 2.55e-10  6.81e-7
## 4 Siderastrea_Mexico_Bars…     55.3           3.20 0.514  6.22 4.95e-10  9.89e-7
## 5 Siderastrea_Mexico_Bars…     36.0           1.83 0.322  5.69 1.26e- 8  2.01e-5
## 6 Siderastrea_Mexico_Bars…     32.8          -2.48 0.449 -5.52 3.39e- 8  4.52e-5
# save rlogged file
saveRDS(DE_RO_RC_pvalue, file = "Sid_Host_RO_RC_pval.05.rds")
write.csv(DE_RO_RC_pvalue, "Sid_Host_RO_RC_pval.05.csv")
dim(DE_RO_RC_pvalue)
## [1] 1720    7

“Ojo.Control”, “Lagoon.Control”

p-adj < 0.1

results_OC_LC_.1 <- results(dds, contrast = c("group", "Ojo.Control", "Lagoon.Control"), alpha = 0.1)
head(results_OC_LC_.1)
## log2 fold change (MLE): group Ojo.Control vs Lagoon.Control 
## Wald test p-value: group Ojo.Control vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002       0.153016
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116      -0.112814
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500       1.659634
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688       0.819956
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      -0.122044
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096      -0.397426
##                                                        lfcSE       stat
##                                                    <numeric>  <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.550440  0.0986919
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.666173 -0.1693470
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.681979  2.4335544
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.816767  1.0039040
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.801927 -0.1521884
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.583254 -0.6813948
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.9213829   1.00000
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.8655237   1.00000
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.0149514   0.95653
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.3154249   1.00000
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.8790383   1.00000
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.4956217   1.00000
summary(results_OC_LC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 0, 0%
## LFC < 0 (down)     : 1, 0.013%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_OC_LC_.1_df <- data.frame(results_OC_LC_.1)

DE_OC_LC_padj.1 <- res_OC_LC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OC_LC_padj.1)
## # A tibble: 1 x 7
##   gene                       baseMean log2FoldChange lfcSE  stat   pvalue   padj
##   <chr>                         <dbl>          <dbl> <dbl> <dbl>    <dbl>  <dbl>
## 1 Siderastrea_Mexico_Barshi…     13.4          -4.43 0.996 -4.45  8.46e-6 0.0677
# save rlogged and csv files
saveRDS(DE_OC_LC_padj.1, file = "Sid_Host_DE_genes_OC_LC_padj.1.rds")
write.csv(DE_OC_LC_padj.1, "Sid_Host_DE_genes_OC_LC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

down.results_OC_LC_padj.1 = row.names(results_OC_LC_.1[results_OC_LC_.1$padj<0.1 & !(is.na(results_OC_LC_.1$padj)) & results_OC_LC_.1$log2FoldChange<0,])

down.results_OC_LC_padj.1 <- as.data.frame(down.results_OC_LC_padj.1)
down.results_OC_LC_padj.1$DEG <- 'Down'

OC_LC_padj.1_summary <- down.results_OC_LC_padj.1
dim(OC_LC_padj.1_summary)
## [1] 1 2
# save file
write.csv(OC_LC_padj.1_summary, file="Sid_Host_DE_genes_OC_LC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OC_LC_pvalue <- res_OC_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OC_LC_pvalue)
## # A tibble: 6 x 7
##   gene                       baseMean log2FoldChange lfcSE  stat   pvalue   padj
##   <chr>                         <dbl>          <dbl> <dbl> <dbl>    <dbl>  <dbl>
## 1 Siderastrea_Mexico_Barshi…    13.4           -4.43 0.996 -4.45  8.46e-6 0.0677
## 2 Siderastrea_Mexico_Barshi…    43.3            3.50 0.889  3.94  8.22e-5 0.242 
## 3 Siderastrea_Mexico_Barshi…     3.14           4.14 1.07   3.87  1.09e-4 0.242 
## 4 Siderastrea_Mexico_Barshi…    28.5            1.67 0.435  3.84  1.21e-4 0.242 
## 5 Siderastrea_Mexico_Barshi…    43.5            2.08 0.549  3.78  1.55e-4 0.248 
## 6 Siderastrea_Mexico_Barshi…     9.53          -2.87 0.772 -3.72  2.01e-4 0.268
# save rlogged and csv files
saveRDS(DE_OC_LC_pvalue, file = "Sid_Host_OC_LC_pval.05.rds")
write.csv(DE_OC_LC_pvalue, "Sid_Host_OC_LC_pval.05.csv")
dim(DE_OC_LC_pvalue)
## [1] 336   7

“Ojo.Control”, “Reef.Control”

p-adj < 0.1

results_OC_RC_.1 <- results(dds, contrast = c("group", "Ojo.Control", "Reef.Control"), alpha = 0.1)
head(results_OC_RC_.1)
## log2 fold change (MLE): group Ojo.Control vs Reef.Control 
## Wald test p-value: group Ojo.Control vs Reef.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002    -2.85640874
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116     0.21834628
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500     0.00690698
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688     0.35764983
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677     0.27522560
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096     0.35379819
##                                                        lfcSE       stat
##                                                    <numeric>  <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.307352 -2.1848808
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.625842  0.3488841
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.586264  0.0117814
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.744292  0.4805239
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.755810  0.3641465
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.554247  0.6383407
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.0288976  0.196342
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.7271763  0.903826
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.9906001  0.996213
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.6308549  0.867257
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.7157486  0.900279
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.5232519  0.814119
summary(results_OC_RC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 206, 2.6%
## LFC < 0 (down)     : 528, 6.6%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_OC_RC_.1_df <- data.frame(results_OC_RC_.1)

DE_OC_RC_padj.1 <- res_OC_RC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OC_RC_padj.1)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     64.9          -2.91 0.382 -7.62 2.63e-14 2.11e-10
## 2 Siderastrea_Mexico_Bars…    206.           -2.57 0.362 -7.10 1.23e-12 4.94e- 9
## 3 Siderastrea_Mexico_Bars…     28.5          -4.02 0.582 -6.90 5.26e-12 1.40e- 8
## 4 Siderastrea_Mexico_Bars…    245.           -4.49 0.662 -6.79 1.14e-11 2.28e- 8
## 5 Siderastrea_Mexico_Bars…   4983.           -3.49 0.540 -6.47 9.90e-11 1.58e- 7
## 6 Siderastrea_Mexico_Bars…     17.0          -4.42 0.700 -6.31 2.78e-10 3.71e- 7
# save rlogged and csv files
saveRDS(DE_OC_RC_padj.1, file = "Sid_Host_DE_genes_OC_RC_padj.1.rds")
write.csv(DE_OC_RC_padj.1, "Sid_Host_DE_genes_OC_RC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OC_RC_padj.1 = row.names(results_OC_RC_.1[results_OC_RC_.1$padj<0.1 & !(is.na(results_OC_RC_.1$padj)) & results_OC_RC_.1$log2FoldChange>0,])

up.results_OC_RC_padj.1 <- as.data.frame(up.results_OC_RC_padj.1)
up.results_OC_RC_padj.1$DEG <- 'Up'

down.results_OC_RC_padj.1 = row.names(results_OC_RC_.1[results_OC_RC_.1$padj<0.1 & !(is.na(results_OC_RC_.1$padj)) & results_OC_RC_.1$log2FoldChange<0,])

down.results_OC_RC_padj.1 <- as.data.frame(down.results_OC_RC_padj.1)
down.results_OC_RC_padj.1$DEG <- 'Down'

OC_RC_padj.1_summary <- merge(up.results_OC_RC_padj.1, down.results_OC_RC_padj.1, all=T)
dim(OC_RC_padj.1_summary)
## [1] 734   3
# save file
write.csv(OC_RC_padj.1_summary , file="Sid_Host_DE_genes_OC_RC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OC_RC_pvalue <- res_OC_RC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OC_RC_pvalue)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     64.9          -2.91 0.382 -7.62 2.63e-14 2.11e-10
## 2 Siderastrea_Mexico_Bars…    206.           -2.57 0.362 -7.10 1.23e-12 4.94e- 9
## 3 Siderastrea_Mexico_Bars…     28.5          -4.02 0.582 -6.90 5.26e-12 1.40e- 8
## 4 Siderastrea_Mexico_Bars…    245.           -4.49 0.662 -6.79 1.14e-11 2.28e- 8
## 5 Siderastrea_Mexico_Bars…   4983.           -3.49 0.540 -6.47 9.90e-11 1.58e- 7
## 6 Siderastrea_Mexico_Bars…     17.0          -4.42 0.700 -6.31 2.78e-10 3.71e- 7
# save rlogged and csv file
saveRDS(DE_OC_RC_pvalue, file = "Sid_Host_OC_RC_pval.05.rds")
write.csv(DE_OC_RC_pvalue, "Sid_Host_OC_RC_pval.05.csv")
dim(DE_OC_RC_pvalue)
## [1] 1492    7

“Lagoon.Control”, “Reef.Control”

p-adj < 0.1

results_LC_RC_.1 <- results(dds, contrast = c("group", "Lagoon.Control", "Reef.Control"), alpha = 0.1)
head(results_LC_RC_.1)
## log2 fold change (MLE): group Lagoon.Control vs Reef.Control 
## Wald test p-value: group Lagoon.Control vs Reef.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      -3.009425
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116       0.331161
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500      -1.652727
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688      -0.462306
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677       0.397270
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096       0.751225
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.220910 -2.464902
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.584166  0.566895
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.613122 -2.695590
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.730706 -0.632685
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.706896  0.561992
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.516116  1.455534
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002 0.0137051 0.1573204
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005 0.5707855 0.8438008
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009 0.0070264 0.0996904
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010 0.5269397 0.8253946
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011 0.5741216 0.8441350
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014 0.1455215 0.5084595
summary(results_LC_RC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 142, 1.8%
## LFC < 0 (down)     : 423, 5.3%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
### convert to dataframe (from DESeq2 object)
res_LC_RC_.1_df <- data.frame(results_LC_RC_.1)

DE_LC_RC_padj.1 <- res_LC_RC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_LC_RC_padj.1)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     64.9          -2.77 0.356 -7.78 7.13e-15 5.70e-11
## 2 Siderastrea_Mexico_Bars…     45.2          -2.82 0.434 -6.49 8.70e-11 3.48e- 7
## 3 Siderastrea_Mexico_Bars…   4983.           -3.28 0.511 -6.42 1.37e-10 3.64e- 7
## 4 Siderastrea_Mexico_Bars…    245.           -3.90 0.622 -6.27 3.60e-10 7.20e- 7
## 5 Siderastrea_Mexico_Bars…     52.5          -1.98 0.325 -6.10 1.05e- 9 1.68e- 6
## 6 Siderastrea_Mexico_Bars…     28.5          -3.02 0.506 -5.97 2.35e- 9 3.13e- 6
# save rlogged and csv file
saveRDS(DE_LC_RC_padj.1, file = "Sid_Host_DE_genes_LC_RC_padj.1.rds")
write.csv(DE_LC_RC_padj.1, "Sid_Host_DE_genes_LC_RC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_LC_RC_padj.1 = row.names(results_LC_RC_.1[results_LC_RC_.1$padj<0.1 & !(is.na(results_LC_RC_.1$padj)) & results_LC_RC_.1$log2FoldChange>0,])

up.results_LC_RC_padj.1 <- as.data.frame(up.results_LC_RC_padj.1)
up.results_LC_RC_padj.1$DEG <- 'Up'

down.results_LC_RC_padj.1 = row.names(results_LC_RC_.1[results_LC_RC_.1$padj<0.1 & !(is.na(results_LC_RC_.1$padj)) & results_LC_RC_.1$log2FoldChange<0,])

down.results_LC_RC_padj.1 <- as.data.frame(down.results_LC_RC_padj.1)
down.results_LC_RC_padj.1$DEG <- 'Down'

LC_RC_padj.1_summary <- merge(up.results_LC_RC_padj.1, down.results_LC_RC_padj.1, all=T)
dim(LC_RC_padj.1_summary)
## [1] 565   3
# save file
write.csv(LC_RC_padj.1_summary, file="Sid_Host_DE_genes_LC_RC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_LC_RC_pvalue <- res_LC_RC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_LC_RC_pvalue)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat   pvalue     padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 Siderastrea_Mexico_Bars…     64.9          -2.77 0.356 -7.78 7.13e-15 5.70e-11
## 2 Siderastrea_Mexico_Bars…     45.2          -2.82 0.434 -6.49 8.70e-11 3.48e- 7
## 3 Siderastrea_Mexico_Bars…   4983.           -3.28 0.511 -6.42 1.37e-10 3.64e- 7
## 4 Siderastrea_Mexico_Bars…    245.           -3.90 0.622 -6.27 3.60e-10 7.20e- 7
## 5 Siderastrea_Mexico_Bars…     52.5          -1.98 0.325 -6.10 1.05e- 9 1.68e- 6
## 6 Siderastrea_Mexico_Bars…     28.5          -3.02 0.506 -5.97 2.35e- 9 3.13e- 6
# save rlogged and csv files
saveRDS(DE_LC_RC_pvalue, file = "Sid_Host_LC_RC_pval.05.rds")
write.csv(DE_LC_RC_pvalue, "Sid_Host_LC_RC_pval.05.csv")
dim(DE_LC_RC_pvalue)
## [1] 1284    7

low pH vs. ambient pH, but not comparative sites

“Ojo.Ojo”, “Reef.Control”

p-adj < 0.1

results_OO_RC_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Reef.Control"), alpha = 0.1)
head(results_OO_RC_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Reef.Control 
## Wald test p-value: group Ojo.Ojo vs Reef.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      -1.212477
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116       0.301156
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500      -0.933556
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688       0.698158
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677       0.584244
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096       0.128063
##                                                        lfcSE      stat
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.112066 -1.090293
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.609106  0.494422
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.600713 -1.554079
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.708357  0.985602
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.721157  0.810147
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.541213  0.236623
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.275584  0.590527
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.621008  0.838319
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.120166  0.382550
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.324329  0.630140
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.417856  0.706617
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.812950  0.935688
summary(results_OO_RC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 332, 4.2%
## LFC < 0 (down)     : 603, 7.5%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OO_RC_.1_df <- data.frame(results_OO_RC_.1)

DE_OO_RC_padj.1 <- res_OO_RC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_RC_padj.1)
## # A tibble: 6 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat   pvalue      padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 Siderastrea_Mexico_Bar…     64.9          -2.63 0.363 -7.23 4.71e-13   3.77e-9
## 2 Siderastrea_Mexico_Bar…     28.5           2.69 0.399  6.74 1.55e-11   4.35e-8
## 3 Siderastrea_Mexico_Bar…     28.5          -3.72 0.552 -6.74 1.63e-11   4.35e-8
## 4 Siderastrea_Mexico_Bar…    206.           -2.26 0.343 -6.57 5.13e-11   1.03e-7
## 5 Siderastrea_Mexico_Bar…     13.4          -4.65 0.779 -5.98 2.29e- 9   2.79e-6
## 6 Siderastrea_Mexico_Bar…     32.9          -3.94 0.660 -5.97 2.44e- 9   2.79e-6
# save rlogged file
saveRDS(DE_OO_RC_padj.1, file = "Sid_Host_DE_genes_OO_RC_padj.1.rds")
write.csv(DE_OO_RC_padj.1, "Sid_Host_DE_genes_OO_RC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OO_RC_padj.1 = row.names(results_OO_RC_.1[results_OO_RC_.1$padj<0.1 & !(is.na(results_OO_RC_.1$padj)) & results_OO_RC_.1$log2FoldChange>0,])

up.results_OO_RC_padj.1 <- as.data.frame(up.results_OO_RC_padj.1)
up.results_OO_RC_padj.1$DEG <- 'Up'

down.results_OO_RC_padj.1 = row.names(results_OO_RC_.1[results_OO_RC_.1$padj<0.1 & !(is.na(results_OO_RC_.1$padj)) & results_OO_RC_.1$log2FoldChange<0,])

down.results_OO_RC_padj.1 <- as.data.frame(down.results_OO_RC_padj.1)
down.results_OO_RC_padj.1$DEG <- 'Down'

OO_RC_padj.1_summary <- merge(up.results_OO_RC_padj.1, down.results_OO_RC_padj.1, all=T)
dim(OO_RC_padj.1_summary )
## [1] 935   3
# save file
write.csv(OO_RC_padj.1_summary , file="Sid_Host_DE_genes_OO_RC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_RC_pvalue <- res_OO_RC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_RC_pvalue)
## # A tibble: 6 x 7
##   gene                    baseMean log2FoldChange lfcSE  stat   pvalue      padj
##   <chr>                      <dbl>          <dbl> <dbl> <dbl>    <dbl>     <dbl>
## 1 Siderastrea_Mexico_Bar…     64.9          -2.63 0.363 -7.23 4.71e-13   3.77e-9
## 2 Siderastrea_Mexico_Bar…     28.5           2.69 0.399  6.74 1.55e-11   4.35e-8
## 3 Siderastrea_Mexico_Bar…     28.5          -3.72 0.552 -6.74 1.63e-11   4.35e-8
## 4 Siderastrea_Mexico_Bar…    206.           -2.26 0.343 -6.57 5.13e-11   1.03e-7
## 5 Siderastrea_Mexico_Bar…     42.2           5.04 0.842  5.98 2.18e- 9   2.79e-6
## 6 Siderastrea_Mexico_Bar…     13.4          -4.65 0.779 -5.98 2.29e- 9   2.79e-6
# save rlogged and csv files
saveRDS(DE_OO_RC_pvalue, file = "Sid_Host_OO_RC_pval.05.rds")
write.csv(DE_OO_RC_pvalue, "Sid_Host_OO_RC_pval.05.csv")
dim(DE_OO_RC_pvalue)
## [1] 1683    7

low pH vs. ambient pH, but not comparative sites

“Ojo.Ojo”, “Lagoon.Control”

p-adj < 0.1

results_OO_LC_.1 <- results(dds, contrast = c("group", "Ojo.Ojo", "Lagoon.Control"), alpha = 0.1)
head(results_OO_LC_.1)
## log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Control 
## Wald test p-value: group Ojo.Ojo vs Lagoon.Control 
## DataFrame with 6 rows and 6 columns
##                                                     baseMean log2FoldChange
##                                                    <numeric>      <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002   1.47002      1.7969475
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005   4.21116     -0.0300051
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009   6.81500      0.7191713
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010   5.30688      1.1604645
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011   4.21677      0.1869740
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  12.58096     -0.6231615
##                                                        lfcSE       stat
##                                                    <numeric>  <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  1.389743  1.2930071
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.650476 -0.0461279
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.694440  1.0356137
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.784161  1.4798796
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.769355  0.2430270
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.570883 -1.0915744
##                                                       pvalue      padj
##                                                    <numeric> <numeric>
## Siderastrea_Mexico_Barshis-Radice_Host_contig00002  0.196009  0.972846
## Siderastrea_Mexico_Barshis-Radice_Host_contig00005  0.963208  0.999661
## Siderastrea_Mexico_Barshis-Radice_Host_contig00009  0.300382  0.996682
## Siderastrea_Mexico_Barshis-Radice_Host_contig00010  0.138905  0.959800
## Siderastrea_Mexico_Barshis-Radice_Host_contig00011  0.807984  0.996682
## Siderastrea_Mexico_Barshis-Radice_Host_contig00014  0.275020  0.996682
summary(results_OO_LC_.1)
## 
## out of 8000 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 10, 0.12%
## LFC < 0 (down)     : 9, 0.11%
## outliers [1]       : 3, 0.037%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# convert to dataframe (from DESeq2 object)
res_OO_LC_.1_df <- data.frame(results_OO_LC_.1)

DE_OO_LC_padj.1 <- res_OO_LC_.1_df %>% 
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(padj < 0.1) %>% 
    arrange(padj)

head(DE_OO_LC_padj.1)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat    pvalue    padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bars…    43.3            4.53 0.850  5.33   9.57e-8 7.66e-4
## 2 Siderastrea_Mexico_Bars…    47.5            1.81 0.375  4.82   1.40e-6 5.61e-3
## 3 Siderastrea_Mexico_Bars…     9.81           2.83 0.607  4.65   3.30e-6 8.79e-3
## 4 Siderastrea_Mexico_Bars…   615.            -5.75 1.25  -4.58   4.54e-6 9.08e-3
## 5 Siderastrea_Mexico_Bars…     3.14           4.56 1.04   4.37   1.26e-5 2.02e-2
## 6 Siderastrea_Mexico_Bars…    28.5            1.78 0.419  4.23   2.29e-5 3.05e-2
# save rlogged file
saveRDS(DE_OO_LC_padj.1, file = "Sid_Host_DE_genes_OO_LC_padj.1.rds")
write.csv(DE_OO_LC_padj.1, "Sid_Host_DE_genes_OO_LC_padj.1.csv")

Pairwise contrast: up- vs. down-regulated genes

up.results_OO_LC_padj.1 = row.names(results_OO_LC_.1[results_OO_LC_.1$padj<0.1 & !(is.na(results_OO_LC_.1$padj)) & results_OO_LC_.1$log2FoldChange>0,])

up.results_OO_LC_padj.1 <- as.data.frame(up.results_OO_LC_padj.1)
up.results_OO_LC_padj.1$DEG <- 'Up'

down.results_OO_LC_padj.1 = row.names(results_OO_LC_.1[results_OO_LC_.1$padj<0.1 & !(is.na(results_OO_LC_.1$padj)) & results_OO_LC_.1$log2FoldChange<0,])

down.results_OO_LC_padj.1 <- as.data.frame(down.results_OO_LC_padj.1)
down.results_OO_LC_padj.1$DEG <- 'Down'

OO_LC_padj.1_summary <- merge(up.results_OO_LC_padj.1, down.results_OO_LC_padj.1, all=T)
dim(OO_LC_padj.1_summary)
## [1] 19  3
# save file
write.csv(OO_LC_padj.1_summary, file="Sid_Host_DE_genes_OO_LC_padj.1_summary.csv")

arrange by lowest p-values

This is later used as input for PERMANOVA

DE_OO_LC_pvalue <- res_OO_LC_.1_df %>%
  rownames_to_column('gene') %>% 
  as_tibble() %>%
  filter(pvalue < 0.05) %>% 
    arrange(pvalue)

head(DE_OO_LC_pvalue)
## # A tibble: 6 x 7
##   gene                     baseMean log2FoldChange lfcSE  stat    pvalue    padj
##   <chr>                       <dbl>          <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1 Siderastrea_Mexico_Bars…    43.3            4.53 0.850  5.33   9.57e-8 7.66e-4
## 2 Siderastrea_Mexico_Bars…    47.5            1.81 0.375  4.82   1.40e-6 5.61e-3
## 3 Siderastrea_Mexico_Bars…     9.81           2.83 0.607  4.65   3.30e-6 8.79e-3
## 4 Siderastrea_Mexico_Bars…   615.            -5.75 1.25  -4.58   4.54e-6 9.08e-3
## 5 Siderastrea_Mexico_Bars…     3.14           4.56 1.04   4.37   1.26e-5 2.02e-2
## 6 Siderastrea_Mexico_Bars…    28.5            1.78 0.419  4.23   2.29e-5 3.05e-2
# save rlogged and csv files
saveRDS(DE_OO_LC_pvalue, file = "Sid_Host_OO_LC_pval.05.rds")
write.csv(DE_OO_LC_pvalue, "Sid_Host_OO_LC_pval.05.csv")
dim(DE_OO_LC_pvalue)
## [1] 480   7
mcols(results_OO_LC_.1)$description
## [1] "mean of normalized counts for all samples"              
## [2] "log2 fold change (MLE): group Ojo.Ojo vs Lagoon.Control"
## [3] "standard error: group Ojo.Ojo vs Lagoon.Control"        
## [4] "Wald statistic: group Ojo.Ojo vs Lagoon.Control"        
## [5] "Wald test p-value: group Ojo.Ojo vs Lagoon.Control"     
## [6] "BH adjusted p-values"

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] affy_1.66.0                 locfit_1.5-9.4             
##  [3] arrayQualityMetrics_3.44.0  DEGreport_1.24.1           
##  [5] DESeq2_1.28.1               SummarizedExperiment_1.18.2
##  [7] DelayedArray_0.14.1         matrixStats_0.58.0         
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
## [11] IRanges_2.22.2              S4Vectors_0.26.1           
## [13] BiocManager_1.30.10         data.table_1.14.0          
## [15] gridExtra_2.3               knitr_1.31                 
## [17] plyr_1.8.6                  vsn_3.56.0                 
## [19] Biobase_2.48.0              BiocGenerics_0.34.0        
## [21] devtools_2.3.2              usethis_2.0.1              
## [23] ggrepel_0.9.1               cowplot_1.1.1              
## [25] vegan_2.5-7                 lattice_0.20-41            
## [27] permute_0.9-5               forcats_0.5.1              
## [29] stringr_1.4.0               dplyr_1.0.4                
## [31] purrr_0.3.4                 readr_1.4.0                
## [33] tidyr_1.1.2                 tibble_3.1.0               
## [35] ggplot2_3.3.3               tidyverse_1.3.0            
## [37] ComplexHeatmap_2.4.3        gplots_3.1.1               
## [39] pheatmap_1.0.12             circlize_0.4.12            
## [41] RColorBrewer_1.1-2         
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4                  tidyselect_1.1.0           
##   [3] beadarray_2.38.0            htmlwidgets_1.5.3          
##   [5] RSQLite_2.2.3               AnnotationDbi_1.50.3       
##   [7] BiocParallel_1.22.0         munsell_0.5.0              
##   [9] codetools_0.2-18            preprocessCore_1.50.0      
##  [11] withr_2.4.1                 colorspace_2.0-0           
##  [13] highr_0.8                   rstudioapi_0.13            
##  [15] setRNG_2013.9-1             labeling_0.4.2             
##  [17] lasso2_1.2-21.1             GenomeInfoDbData_1.2.3     
##  [19] hwriter_1.3.2               mnormt_2.0.2               
##  [21] farver_2.0.3                bit64_4.0.5                
##  [23] rprojroot_2.0.2             vctrs_0.3.6                
##  [25] generics_0.1.0              xfun_0.21                  
##  [27] R6_2.5.0                    illuminaio_0.30.0          
##  [29] clue_0.3-58                 gridSVG_1.7-2              
##  [31] bitops_1.0-6                cachem_1.0.4               
##  [33] reshape_0.8.8               assertthat_0.2.1           
##  [35] scales_1.1.1                nnet_7.3-15                
##  [37] gtable_0.3.0                processx_3.4.5             
##  [39] rlang_0.4.10                genefilter_1.70.0          
##  [41] systemfonts_1.0.1           GlobalOptions_0.1.2        
##  [43] splines_4.0.3               hexbin_1.28.2              
##  [45] checkmate_2.0.0             broom_0.7.5                
##  [47] reshape2_1.4.4              yaml_2.2.1                 
##  [49] modelr_0.1.8                backports_1.2.1            
##  [51] Hmisc_4.4-2                 tools_4.0.3                
##  [53] psych_2.0.12                logging_0.10-108           
##  [55] affyio_1.58.0               ellipsis_0.3.1             
##  [57] jquerylib_0.1.3             ggdendro_0.1.22            
##  [59] sessioninfo_1.1.1           Rcpp_1.0.6                 
##  [61] base64enc_0.1-3             zlibbioc_1.34.0            
##  [63] RCurl_1.98-1.2              ps_1.5.0                   
##  [65] prettyunits_1.1.1           openssl_1.4.3              
##  [67] rpart_4.1-15                GetoptLong_1.0.5           
##  [69] haven_2.3.1                 cluster_2.1.1              
##  [71] fs_1.5.0                    magrittr_2.0.1             
##  [73] reprex_1.0.0                tmvnsim_1.0-2              
##  [75] pkgload_1.2.0               hms_1.0.0                  
##  [77] evaluate_0.14               xtable_1.8-4               
##  [79] XML_3.99-0.5                jpeg_0.1-8.1               
##  [81] gcrma_2.60.0                readxl_1.3.1               
##  [83] shape_1.4.5                 testthat_3.0.2             
##  [85] compiler_4.0.3              KernSmooth_2.23-18         
##  [87] crayon_1.4.1                htmltools_0.5.1.1          
##  [89] mgcv_1.8-34                 Formula_1.2-4              
##  [91] geneplotter_1.66.0          lubridate_1.7.10           
##  [93] DBI_1.1.1                   dbplyr_2.1.0               
##  [95] MASS_7.3-53.1               Matrix_1.3-2               
##  [97] cli_2.3.1                   pkgconfig_2.0.3            
##  [99] foreign_0.8-81              xml2_1.3.2                 
## [101] svglite_1.2.3.2             annotate_1.66.0            
## [103] BeadDataPackR_1.40.0        bslib_0.2.4                
## [105] affyPLM_1.64.0              XVector_0.28.0             
## [107] rvest_0.3.6                 callr_3.5.1                
## [109] digest_0.6.27               ConsensusClusterPlus_1.52.0
## [111] Biostrings_2.56.0           base64_2.0                 
## [113] rmarkdown_2.7               cellranger_1.1.0           
## [115] htmlTable_2.1.0             edgeR_3.30.3               
## [117] gdtools_0.2.3               gtools_3.8.2               
## [119] rjson_0.2.20                lifecycle_1.0.0            
## [121] nlme_3.1-152                jsonlite_1.7.2             
## [123] askpass_1.1                 desc_1.2.0                 
## [125] limma_3.44.3                fansi_0.4.2                
## [127] pillar_1.5.0                Nozzle.R1_1.1-1            
## [129] fastmap_1.1.0               httr_1.4.2                 
## [131] pkgbuild_1.2.0              survival_3.2-7             
## [133] glue_1.4.2                  remotes_2.2.0              
## [135] png_0.1-7                   bit_4.0.4                  
## [137] stringi_1.5.3               sass_0.3.1                 
## [139] blob_1.2.1                  latticeExtra_0.6-29        
## [141] caTools_1.18.1              memoise_2.0.0